class: center, middle, inverse, title-slide # Tidyverse Intro II ### Antoine & Nicolas ### cynkra GmbH ### February 1, 2022 --- <style type="text/css"> .pull-left { margin-top: -25px; } .pull-right { margin-top: -25px; } .remark-code { font-size: 12px; } .font17 { font-size: 17px; } .font14 { font-size: 14px; } </style> # Introduction Organization of half-day R courses: - Intro courses: * Tidyverse intro I * Base R intro/Tidyverse intro II (this course) * Data visualization I * Data visualization II - Advanced courses: * Advanced tidyverse (this afternoon) * R package creation * Working with database systems * Parallelization & efficient R programming * Advanced topics (tbd) --- # Course material Our course material currently is available from Github at https://github.com/cynkra/bag-courses Today we will be looking at the folder `1-2_intro_tidy-ii`  --- # General remarks - Even though we are starting out remotely, we hope for these courses to be interactive: go ahead and ask if something is unclear! - You can also write into the chat, which I will try to monitor when Antoine is presenting. - We were asked to provide recordings of the courses for those of you who cannot join, so recording is activated. - Per course unit, we offer 4 hours of follow up time; approach us with questions (nicolas@cynkra.com)! --- # RStudio Intro  --- # Assignment in R Assignment means we *bind* a *value* to a *name* in an *environment*. .pull-left[ ```r a <- FALSE b <- "a" c <- 2.3 d <- c(1, 2, 3) ``` ] .pull-right[ <img src="data:image/png;base64,#bindings.png" width="50%" style="display: block; margin: auto;" /> ] The assignment operator `<-` performs this *binding* in the current environment, here the global environment (`.GlobalEnv`). .pull-left[ ```r 0a <- 1 ``` ``` ## Error: unexpected symbol in "0a" ``` ] .pull-right[ ```r if <- 1 ``` ``` ## Error: unexpected assignment in "if <-" ``` ] There are some rules as to what names are permissible. --- # Retrieving values A value can be accessed via its name (not a string!). .pull-left[ ```r a ``` ``` ## [1] FALSE ``` ] .pull-right[ ```r b ``` ``` ## [1] "a" ``` ] (If it is accessible from the current environment.) ```r some_crazy_name ``` ``` ## Error: object 'some_crazy_name' not found ``` Bindings in an environment can be listed using `ls()`. .pull-left[ ```r ls() ``` ``` ## [1] "a" "b" "c" ## [4] "d" "denominator" "frac_1" ## [7] "frac_2" "frac_fun_gen" "fraction" ## [10] "half" "i" "is_rstudio_console" ## [13] "l1" "m" "mat" ## [16] "op_fun" "q" "res" ## [19] "result" "rndr" "v1" ## [22] "v2" "v3" "width" ## [25] "x" "x_chr" "x_int" ## [28] "x_log" "x_num" "y" ``` ] .pull-right[ ```r ls(envir = new.env()) ``` ``` ## character(0) ``` ] --- # What objects are accessible? .pull-left[ ```r ls() ``` ``` ## [1] "a" "b" "c" ## [4] "d" "denominator" "frac_1" ## [7] "frac_2" "frac_fun_gen" "fraction" ## [10] "half" "i" "is_rstudio_console" ## [13] "l1" "m" "mat" ## [16] "op_fun" "q" "res" ## [19] "result" "rndr" "v1" ## [22] "v2" "v3" "width" ## [25] "x" "x_chr" "x_int" ## [28] "x_log" "x_num" "y" ``` ] .pull-right[ ```r mean ``` ``` ## function (x, ...) ## UseMethod("mean") ## <bytecode: 0x7fc102c205b0> ## <environment: namespace:base> ``` ] .pull-left[ ```r search() ``` ``` ## [1] ".GlobalEnv" "package:readr" ## [3] "package:stats" "package:graphics" ## [5] "package:grDevices" "package:utils" ## [7] "package:datasets" "package:colorout" ## [9] "package:devtools" "package:usethis" ## [11] "package:methods" "Autoloads" ## [13] "package:base" ``` ] .pull-right[ ```r library(readr) search() ``` ``` ## [1] ".GlobalEnv" "package:readr" ## [3] "package:stats" "package:graphics" ## [5] "package:grDevices" "package:utils" ## [7] "package:datasets" "package:colorout" ## [9] "package:devtools" "package:usethis" ## [11] "package:methods" "Autoloads" ## [13] "package:base" ``` ] <img src="data:image/png;base64,#search-path.png" width="75%" style="display: block; margin: auto;" /> --- # Base R vector classes I The function `c()` can be used to combine objects, such as literals. ```r x_log <- c(TRUE, FALSE) # same as c(T, F) x_int <- c(1L, 2L, 3L) # use 1L to enforce integer, rather than numeric x_num <- c(1, 2, 6.3, 0.12) # also called 'double' x_chr <- c("Hello World") # or 'Hello World' ``` We can check the type using `class()` and the length with `length()`. .pull-left[ ```r class(x_log) ``` ``` ## [1] "logical" ``` ```r class(c(x_int, x_chr)) ``` ``` ## [1] "character" ``` ] .pull-right[ ```r length(x_chr) ``` ``` ## [1] 1 ``` ```r length(c(x_int, x_chr)) ``` ``` ## [1] 4 ``` ] There is no type distinction between scalar and vector values! ??? There is a certain order in the list above: `logical` is the least flexible type, while `character` is the most flexible. If you combine vectors of different type, the more flexible class will win --- # Base R vector classes II The class of a vector can safely be changed to a more "general" type. .pull-left[ ```r as.logical(c(1, 0, 2)) ``` ``` ## [1] TRUE FALSE TRUE ``` ```r as.integer(c(TRUE, FALSE, TRUE)) ``` ``` ## [1] 1 0 1 ``` ] .pull-right[ ```r as.numeric(c("1", "2")) ``` ``` ## [1] 1 2 ``` ```r as.character(c(TRUE, FALSE)) ``` ``` ## [1] "TRUE" "FALSE" ``` ] A change to a more "specific" general type is also possible. ```r as.numeric(c("hi", "number", "1")) ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] NA NA 1 ``` ??? These functions will always work if you coerce towards greater flexibility. If you want to go the other way, it may give you `NA`s and some warnings. --- # Sequences & repetitions Often we need to create vectors with patterns, such as sequences .pull-left[ ```r 1:5 ``` ``` ## [1] 1 2 3 4 5 ``` ```r seq(1, 5) ``` ``` ## [1] 1 2 3 4 5 ``` ] .pull-right[ ```r seq(3, 9, by = 2) ``` ``` ## [1] 3 5 7 9 ``` ```r seq(0, 1, length.out = 5) ``` ``` ## [1] 0.00 0.25 0.50 0.75 1.00 ``` ] or repetitions .pull-left[ ```r rep(5, 4) ``` ``` ## [1] 5 5 5 5 ``` ```r rep("hello", 2) ``` ``` ## [1] "hello" "hello" ``` ] .pull-right[ ```r rep(1:4, 2) ``` ``` ## [1] 1 2 3 4 1 2 3 4 ``` ```r rep(1:4, each = 2) ``` ``` ## [1] 1 1 2 2 3 3 4 4 ``` ] --- # Basic arithmetic Operators `+`, `-`, `*`, `/`, etc. are implemented as functions .pull-left[ ```r 2 + 3 ``` ``` ## [1] 5 ``` ] .pull-right[ ```r `+`(2, 3) ``` ``` ## [1] 5 ``` ] Operations are vectorized (element-wise) .pull-left[ ```r x <- c(1, 2, 4) x + c(5, 0, -1) ``` ``` ## [1] 6 2 3 ``` ```r 1:5 * 2 ``` ``` ## [1] 2 4 6 8 10 ``` ] .pull-right[ ```r x <- c(1, 2, 4) x * c(5, 0, -1) ``` ``` ## [1] 5 0 -4 ``` ```r 1:5 * rep(2, 5) ``` ``` ## [1] 2 4 6 8 10 ``` ] --- # Recycling In case of length mismatch, the shorter vector is recycled. .pull-left[ ```r c(1, 2) + c(6, 0, 9, 20, 22, 11) ``` ``` ## [1] 7 2 10 22 23 13 ``` ] .pull-right[ ```r c(1, 2, 1, 2, 1, 2) + c(6, 0, 9, 20, 22, 11) ``` ``` ## [1] 7 2 10 22 23 13 ``` ] ```r c(1, 2, 3, 4) + c(6, 0, 9, 20, 22, 11) ``` ``` ## Warning in c(1, 2, 3, 4) + c(6, 0, 9, 20, 22, 11): longer object length is not a ## multiple of shorter object length ``` ``` ## [1] 7 2 12 24 23 13 ``` Advice: in general, try to avoid beyond recycling length 1 vectors. --- # Comparison operators .pull-left[ ```r x <- c(1, 2, 4, 2) ``` ] .pull-right[ ```r y <- c(2, 2, 4, 5) ``` ] Inequality: `<`, `>`, `<=`, `>=` .pull-left[ ```r x < 2 ``` ``` ## [1] TRUE FALSE FALSE FALSE ``` ```r x < y ``` ``` ## [1] TRUE FALSE FALSE TRUE ``` ] .pull-right[ ```r x <= 2 ``` ``` ## [1] TRUE TRUE FALSE TRUE ``` ```r x <= y ``` ``` ## [1] TRUE TRUE TRUE TRUE ``` ] Equality (and its negation): `==`, `!=` .pull-left[ ```r x == y ``` ``` ## [1] FALSE TRUE TRUE FALSE ``` ] .pull-right[ ```r x != y ``` ``` ## [1] TRUE FALSE FALSE TRUE ``` ] --- # Logical operators .pull-left[ ```r a <- x >= 2 a ``` ``` ## [1] FALSE TRUE TRUE TRUE ``` ] .pull-right[ ```r b <- x < 4 b ``` ``` ## [1] TRUE TRUE FALSE TRUE ``` ] Boolean operators: `&` (AND) or `|` (OR), `!` (NOT) .pull-left[ ```r a & b ``` ``` ## [1] FALSE TRUE FALSE TRUE ``` ```r a | b ``` ``` ## [1] TRUE TRUE TRUE TRUE ``` ] .pull-right[ ```r !a & b ``` ``` ## [1] TRUE FALSE FALSE FALSE ``` ```r !(a & b) ``` ``` ## [1] TRUE FALSE TRUE FALSE ``` ] We also have `&&` and `||` (not discussed here) --- # Numeric indexing .pull-left[ ```r x <- c(1.2, 3.9, 0.4, 0.12) ``` ] .pull-right[ ```r i <- 3:4 ``` ] We can extract values from a vector, using a numeric index. .pull-left[ ```r x[c(1, 3)] ``` ``` ## [1] 1.2 0.4 ``` ```r x[c(1, 1, 3)] ``` ``` ## [1] 1.2 1.2 0.4 ``` ```r x[-1] ``` ``` ## [1] 3.90 0.40 0.12 ``` ] .pull-right[ ```r x[2:3] ``` ``` ## [1] 3.9 0.4 ``` ```r x[i] ``` ``` ## [1] 0.40 0.12 ``` ```r x[-i] ``` ``` ## [1] 1.2 3.9 ``` ] --- # Logical indexing .pull-left[ ```r x <- c(1.2, 3.9, 0.4, 0.12) x ``` ``` ## [1] 1.20 3.90 0.40 0.12 ``` ] .pull-right[ ```r (i <- rep(c(TRUE, FALSE), each = 2)) ``` ``` ## [1] TRUE TRUE FALSE FALSE ``` ] We can extract values from a vector, using a logical index. .pull-left[ ```r x[i] ``` ``` ## [1] 1.2 3.9 ``` ```r x[!i] ``` ``` ## [1] 0.40 0.12 ``` ```r x[x > 2] ``` ``` ## [1] 3.9 ``` ] .pull-right[ ```r x[TRUE] ``` ``` ## [1] 1.20 3.90 0.40 0.12 ``` ```r x[FALSE] ``` ``` ## numeric(0) ``` ```r x[-i] ``` ``` ## [1] 3.90 0.40 0.12 ``` ] ??? last example: complete craziness! `-i` evaluates to `c(-1, -1, 0, 0)`, which selects all but the first element and adds nothing to this --- # Subset assignment Combines a subsetting operation with an assignment. ```r y <- c(1.2, 3.9, 0.4, 0.12) y[c(2, 4)] <- 5 y[c(FALSE, TRUE, FALSE, TRUE)] <- 5 y ``` ``` ## [1] 1.2 5.0 0.4 5.0 ``` ```r y[y > 2] <- 2 y ``` ``` ## [1] 1.2 2.0 0.4 2.0 ``` ```r y[y == 2] <- c(4, 5) y ``` ``` ## [1] 1.2 4.0 0.4 5.0 ``` --- # Exercises (Homework) 1. Create a vector called `v1` containing the numbers 2, 5, 8, 12 and 16. 1. Extract the values at positions 2 and 5 from `v1`. 1. Use `x:y` notation to make a second vector called `v2` containing the numbers 5 to 9. 1. Subtract `v2` from `v1` and look at the result. 1. Generate a vector with 1000 standard-normally distributed random numbers (use `rnorm()`). Store the result as `v3`. Extract the numbers that are bigger than 2. --- # Solutions I 1. Create a vector called `v1` containing the numbers 2, 5, 8, 12 and 16. ```r v1 <- c(2, 5, 8, 12, 16) ``` 1. Extract the values at positions 2 and 5 from `v1`. ```r v1[c(2, 5)] ``` ``` ## [1] 5 16 ``` 1. Use `x:y` notation to make a second vector called `v2` containing the numbers 5 to 9. ```r v2 <- 5:9 ``` --- # Solutions II 1. Subtract `v2` from `v1` and look at the result. ```r v1 - v2 ``` ``` ## [1] -3 -1 1 4 7 ``` 1. Generate a vector with 1000 standard-normally distributed random numbers (use `rnorm()`). Store the result as `v3`. Extract the numbers that are bigger than 2. ```r v3 <- rnorm(1000) v3[v3 > 2] ``` ``` ## [1] 3.815906 2.972135 2.099225 2.015556 2.234020 2.518990 2.148066 2.987901 ## [9] 2.036843 2.067264 2.633355 2.241635 2.037483 2.169295 2.265483 2.248635 ## [17] 2.544779 2.041437 ``` --- # Matrices Internally represented by a column-major vector, with dimensions. .pull-left[ ```r (m <- matrix(c(1, 4, 2, 2, 7, 3), nrow = 2)) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 7 ## [2,] 4 2 3 ``` ] .pull-right[ ```r dim(m) ``` ``` ## [1] 2 3 ``` ] Two indexes are required for subset selection. .pull-left[ ```r m[1, 2] ``` ``` ## [1] 2 ``` ```r m[, c(2, 3)] ``` ``` ## [,1] [,2] ## [1,] 2 7 ## [2,] 2 3 ``` ] .pull-right[ ```r m[1, ] ``` ``` ## [1] 1 2 7 ``` ```r m[1, , drop = FALSE] ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 7 ``` ] --- # Exercises (Homework) 1. Create a 10 x 10 matrix that contains a sequence of numbers (use the `:` notation). 1. Extract the 2. column of the matrix. 1. Extract the 5. row of the matrix. 1. Extract the 5. and the 6. row of the matrix. 1. Compare the classes of the results to the previous two subsetting operations. 1. Modify 3., so that it returns the same class as 4. --- # Solutions I 1. Create a 10 x 10 matrix that contains a sequence of numbers (use the `:` notation). ```r mat <- matrix(1:100, ncol = 10) ``` 1. Extract the 2. column of the matrix. ```r mat[, 2] ``` ``` ## [1] 11 12 13 14 15 16 17 18 19 20 ``` 1. Extract the 5. row of the matrix. ```r v1 <- mat[5, ] ``` --- # Solutions II 1. Extract the 5. and the 6. row of the matrix. ```r v2 <- mat[c(5, 6), ] ``` 1. Compare the classes of the results to the previous two subsetting operations. ```r class(v1) ``` ``` ## [1] "integer" ``` ```r class(v2) ``` ``` ## [1] "matrix" "array" ``` --- # Solutions III 1. Modify 3., so that it returns the same class as 4. ```r (v3 <- mat[5, , drop = FALSE]) ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 5 15 25 35 45 55 65 75 85 95 ``` ```r class(v3) ``` ``` ## [1] "matrix" "array" ``` --- # Lists .pull-left[ ```r (x <- list(u = c(2, 3, 4), v = "abc")) ``` ``` ## $u ## [1] 2 3 4 ## ## $v ## [1] "abc" ``` ] .pull-right[ ```r length(x) ``` ``` ## [1] 2 ``` ] Subsetting can be done with `$`, `[` or `[[` .pull-left[ ```r x$u ``` ``` ## [1] 2 3 4 ``` ```r x[["u"]] ``` ``` ## [1] 2 3 4 ``` ```r x[[1]] ``` ``` ## [1] 2 3 4 ``` ] .pull-right[ ```r x["u"] ``` ``` ## $u ## [1] 2 3 4 ``` ```r x[1:2] ``` ``` ## $u ## [1] 2 3 4 ## ## $v ## [1] "abc" ``` ] --- # List subsetting mnemonic <img src="data:image/png;base64,#lists.png" width="75%" style="display: block; margin: auto;" /> .pull-left[ ```r (x <- list(list(1:3), list(4:6))) ``` ``` ## [[1]] ## [[1]][[1]] ## [1] 1 2 3 ## ## ## [[2]] ## [[2]][[1]] ## [1] 4 5 6 ``` ] .pull-right[ ```r x[[1]][[1]] ``` ``` ## [1] 1 2 3 ``` ```r x[[1]][[1]][1] ``` ``` ## [1] 1 ``` ] --- # Names .pull-left[ ```r (x <- list(u = c(2, 3, 4), v = "abc")) ``` ``` ## $u ## [1] 2 3 4 ## ## $v ## [1] "abc" ``` ] .pull-right[ ```r names(x) ``` ``` ## [1] "u" "v" ``` ] Any vector in R can have a names attribute. .pull-left[ ```r (y <- c(a = 1, b = 2, c = 3)) ``` ``` ## a b c ## 1 2 3 ``` ```r class(y) ``` ``` ## [1] "numeric" ``` ] .pull-right[ ```r y[c("a", "b")] ``` ``` ## a b ## 1 2 ``` ] --- # Data frames 2-dimensional like matrices, but implemented using lists ```r (d <- data.frame(kids = c("Jack", "Jill", "Jamie"), ages = c(12, 10, 7))) ``` ``` ## kids ages ## 1 Jack 12 ## 2 Jill 10 ## 3 Jamie 7 ``` .pull-left[ ```r d[["ages"]] # same as d$ages ``` ``` ## [1] 12 10 7 ``` ```r d[1, ] ``` ``` ## kids ages ## 1 Jack 12 ``` ```r d[, 1] ``` ``` ## [1] "Jack" "Jill" "Jamie" ``` ] .pull-right[ ```r length(d) # same as ncol(d) ``` ``` ## [1] 2 ``` ```r nrow(d) ``` ``` ## [1] 3 ``` ```r dim(d) ``` ``` ## [1] 3 2 ``` ] --- # Exercises (Homework) 1. Generate two random vectors of length 10 (for example using `runif()`), `a`, and `b`. Combine them in a list, call it `l1`. 1. Compare the classes of `l1[2]` and `l1[[2]]`. Can you explain the difference? 1. How many rows does the data frame `mtcars` contain? The dataset is available by default. Just try typing `mtcars`. 1. Of what type is the column `vs` of `mtcars`. 1. Try printing the column names of `mtcars`. --- # Solutions I 1. Generate two random vectors of length 10 (for example using `runif()`), `a`, and `b`. Combine them in a list, call it `l1`. ```r a <- runif(10) b <- runif(10) l1 <- list(a, b) ``` 1. Compare the classes of `l1[2]` and `l1[[2]]`. Can you explain the difference? .pull-left[ ```r class(l1[2]) ``` ``` ## [1] "list" ``` ] .pull-right[ ```r class(l1[[2]]) ``` ``` ## [1] "numeric" ``` ] Subsetting a list using `[`, will return a list, whereas using `[[`, the object contained in the list at the given position is returned. --- # Solutions II 1. How many rows does the data frame `mtcars` contain? The dataset is available by default. Just try typing `mtcars`. ```r nrow(mtcars) ``` ``` ## [1] 32 ``` 1. Of what type is the column `vs` of `mtcars`. ```r class(mtcars$v2) ``` ``` ## [1] "NULL" ``` 1. Try printing the column names of `mtcars`. ```r colnames(mtcars) ``` ``` ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" ```